It all started when I wanted my own wakeword for Mycroft. I followed the recommendations for creating a wakeword model with Mycroft Precise. It didn't work very well. It kept waking up falsely all the time and it didn't work for my wife at all, even though I collected her saying the wakeword a few times.

This awoke the inner data scientist in me. How can I create a wakeword model that works as good as production quality without having to do thousands and thousands of recordings to get that?

I realized that if I am having this problem, so are many other people. No wonder wakeword systems aren't so ubiquitous. It also made me wonder what other blockers there were in NLP (Natural Language Processing) stopping us from having awesome FOSS voice assistants for all the self hosters out there.

It turns out there are a lot of problems and a whole community of FOSS voice assistant developers facing those same problems. Our only path to success was to unite as a community.

Secret Sauce AI

🔍 Secret Sauce AI is a coordinated community of AI enthusiasts. We have come together as many individuals and projects in the FOSS voice assistant space to solve big AI problems for everyone out there.

We are focused on many areas of AI (especially in NLP), but our 🔎 first project is in the area of wakewords.

Cool, but what's a wakeword exactly?

When you use a voice assistant, you usually start by waking it (ie 'hey Mycroft' or 'hey Siri'). This wakeword is a binary acoustic model ('wakeword' or 'not-wakeword' classes) that triggers ASR (automated speech recognition) transcription when the wakeword is uttered. This is generally how all voice assistants work.

Yeah, that's nice and all but there are already wakewords out there...

True, but making your own customized production quality wakeword can be rather difficult (like impossible) using FOSS solutions, as I personally found out. And it shouldn't be this hard to make your own wakeword!

Let's breakdown the problems and the solutions.

1. Data collection

Problem

How much data do you need, what kinds of data, how do you go about collecting it? There really isn't so much exact information out there, and big companies usually collect thousands to millions of samples to make their production quality wakewords. That is a bit beyond the average self hoster's resources.

Solution

So the solution was to experimentally figure out a data collection recipe and make that data collection as sparse as possible while making sure it produced a production quality wakeword. That's a tall order, but we worked on this (way longer than we want to admit).

We have released a prototype 📦 Wakeword Data Collector in Python that runs a user through the collection process.

2. How do you make a production quality model with machine learning and all of that stuff?

Problem

It can be hard to hit on a winning recipe to train a production quality model, especially if you aren't doing this professionally as a data scientist. There isn't so much information out there on the exact recipe to do this.

Solution

We experimentally figured out the best recipe while keeping the data sparse and made the 📦 Precise Wakeword Model Maker to do it all for you automatically.

It uses Mycroft's Precise engine to train a model for you. It pulls out every trick in AI we know to get a higher quality model from sparse data and uses new ground breaking techniques; from using TTS engines to generate more data, to using incremental and curriculum learning methods to improve the training and testing scores, and much more.

Help!

We would love to have more help working on AI based projects for the self hosting community, as we are just some people who know each other mostly from Reddit and Github that do this in our free time. Please feel free to DM me if you want to help out.

Future projects

Wakeword

This release represents the first phase in the wakeword project, we are working on a 📦 Rust wakeword engine based on Precise and a 📦 SpeechPy MFCC port in Rust so that user's can run the wakeword easily on their phone and other devices. It is hard to believe that there aren't any current good solutions to running a modern FOSS wakeword engine on a phone in real time. We want to change that and allow everyone access to this technology, with their own wakeword of choice.

We would also love to improve upon our current prototype releases and welcome feedback and especially help from the community.

NLU-NLG: Natural Language Understand, Natural Language Generation

This will be the next project we focus on. We will benchmark current solutions, improving general data sets, and publish information to help everyone improve upon their current NLU-NLG use cases. All of this is still a heavy work in progress.

Lots more to come in NLU-NLG, so stay tuned.

Voice Assistant Bus Protocol

We are working on a universal 📦 Voice Assistant Protocol (VAP)

Generally, we are working on many more projects, but it's still too early to speak about them with any detail. If you are curious and want to know more, you can always write me. Once again, we would love any help offered. There are a lot of big AI problems to solve out there and we are just some random passionate folks, not some fancy company or anything.

Member projects

A lot of our Secret Sauce AI members build FOSS voice assistant software. It is always worth checking their software out. We just love this community!

We would love to give a shout out to the folks over at Mycroft, if it weren't for them we wouldn't have a modern FOSS wakeword engine or a lot of other stuff in the FOSS voice assistant community.

A BIG what's up to the Secret Sauce AI member's out there on Reddit:

And last but not least, a special greeting to Dan "The Man" Borufka who's been down with Secret Sauce AI since day one. Thanks Dan!

Comments (22)

tl;dr Secret Sauce AI made some FOSS tools to make your own custom wakewords easily:

and we are working on other FOSS voice assistant related problems in AI to share with the community.

Thanks for the shoutout u/Bartmoss! Keyword detection has definitely been the biggest hurdle in my voice assistant project. I will be implementing the Secret Sauce keyword detection on GLaDOS Voice Assistant soon when I got the time and report back on the process & issues that arise on the way.

Keep up the good work!

Absolutely fantastic, this was the missing piece for me to tinker with smart devices. I needed a central one that controls the others, kind of a hub that I control with a custom wake word. Once activated, it would route custom audio to the ones connected to the internet (alexa & co) that have their microphones physically deactivated.

Great stuff, thank you for this! I'm gonna try it out and report.

That's great to hear that it is useful for you. If you need any help or support, don't be afraid to write me and I'd be more than happy to help and feedback is very welcome. We do this for the community! 😀

I'm super curious now about something. What makes a wake word different than all the other words I yell at my alexa or google across the room which it picks up just fine?

the wakeword is "alexa" or "hey google". it's the thing that you generally prefix before commands, that the assistant is trained to listen for. The goal of a wakeword is to (hopefully) be well trained enough that it only activates when you say it(when you are addressing the assistant in question), and activates everytime you say it. Achieving both of those, or even just one, is something of a high bar, and is a common blocker for people working on their own assistant apps

Sorry, I think I should have worded my question differently but you still answered it. I knew what a wake word was but why it mattered didn't click immediately. Thank you:)

Wakeword detector is reeeeally good at detecting single phrase in a stream of audio and give simple yes-or-no answer, where speech-to-text is more general. They have kind similar goals yes, but the algorithm is different.

I am currently using speech-to-text as a trigger word detector in my voice assistant. It is computationally expensive to continuously run speech detection and to look for spesific word, and this process is much more prone to errors.

This project will hopefully allow me to move to discrete wake word detector model that will be listening to "hey glados" and then hand the mic to the speech-to-text algorighm.

That makes perfect sense, thank you.

The wake word has to be processed locally on the assistant hardware with minimal processing power. Once the assistant wakes it sends all further voice to the cloud where they have datacentres full of servers running to do the voice recognition so it is much easier to do that accurately.

Or you can self host your own ASR like DeepSpeech.

https://github.com/mozilla/DeepSpeech

In short.... the precision needed. and the knowledge of whether it should send it to the cloud.

Bottom line is you are probably going to say a ton of random meaningless crap in your devices voice range. Some of which might be close enough to commands that it would execute them. That's the big thing... when you say

Hey alexa

Turn on the livingroom light.

It knows after hey alexa... that what you are saying is definently a command, and it will match the nearest sounding things.

but before hey alexa... it knows 99.9% of words spoken aren't directed at it. If it's too sensitive then it turns itself on randomly.

This is awesome! Thank you for all your work on this

Considering that Google themselves seems to be having huge problems with wake word identification with their server clusters, tensor flows and AI horse powers. If someone else could think outside the box and somehow solve what Google cannot it would be much welcome. No matter what sensitivity I have my Google speakers set to they trigger the Google wake words from almost any sound at least a couple of times per day. 😁

I think this is a great point to raise and one I would personally like to address. Sorry for the rant but I personally feel very passionate about this subject.

I don't think it is really fair to compare what Google and other companies do when making "general" models (or at least as you have pointed out, trying to make general models) and the philosophy and methodology behind Secret Sauce AI and here's why:

The business as usual for most companies working in AI is to invest a lot of money in initially collecting tons of data, then use that to make models that appear to be more or less "static", users don't notice them improving directly from interactions but they seem to work good enough when doing QA. For example in your case, if the wakeword fails, it will keep failing for you.

The reason this is done this way is simply money. As everyone knows, companies make money off of offering "free" services where they collect lots of data from users that goes beyond just improving the actual service. Now I personally don't think there is anything wrong with that, if users feel like the value of the app, website, or IT service otherwise is worth the agreement to collect their data, that's cool (but I do wish the average user had a better understanding of the data being collected and its value!).

Now a bunch of random people from the internet could never compete with the resources of Google or any other company that does stuff in AI. So we flip the script and switch the game up.

We want to empower users to collect their own data and make their own models to use for their website, app, or IT service otherwise. This means, the model is built just for you and you own the data and the model, too. It can work much better than any general model a big company could make, and it can improve as you collect more of your own data.

To me and the Secret Sauce AI community, this is real AI. AI that actually learns from you directly. But at the same time there is a lot of respect for the user in terms of their data privacy, as we don't collect any data. The tools, methods, you name it, are all totally FOSS.

I personally see this sub and similar communities of self hosters as the spearhead of this movement. The people here are all about running their own IT services and they are way more willing to put up with lots of steps to get stuff working (and keep it working!), where the average person probably doesn't have the patience for this. But I think this technology can get easier and can trickle down into the mainstream. I really hope it does!

So yes, with the wakeword stuff we put up, you will for sure get better results than Google's wakeword, and it can be pretty much any wakeword you want, but I am not sure if these companies will switch their business models because of this.

However, we will continue to work on this new way to do AI and keep giving it to you, open and free. We call it TinyML+ and you can read more about it here.

How is a Wakeword any different than any other text generated from speech-to-text? Don't all words have to be translated to text first in order to tested against the string which is the user's preferred wake-word?

The requirements on a wakeword are much stronger. Generally, wakeword detection is completely separate from the actual text to speech system. While we don't actually know how wakeword detection works in famous commercial systems (Google Assistant, Siri), there are two likely differences with regular text to speech.

  1. Firstly, the states in a wakeword model (i.e. the 'units' of the wakeword the system wants to recognise) are likely specifically trained for that wakeword alone. The 'l' in 'ok Google' only needs to apply to the pronunciation in that exact context (following a g). Pronunciation of sounds is influenced by preceding and following sounds, and general text to speech models need to account for possible variants, which increases coverage, but also unreliability (which is what you want to avoid in wakeword detection).

  2. There is likely some timing enforcement applied to the wakeword as well. There needs to be a specific rhythm to your wakeword pronunciation, else the speech recognition won't activate. This can be enforced by limiting how much a user can deviate in their wakeword pronunciation compared to a preset template. General speech to text doesn't have this requirement.

Also, like others have mentioned, wakeword detection runs locally on the device and likely uses a very efficient task-specific recogniser. Only after this recogniser is activated does the data go to much more powerful Google servers (general speech recognition is a much more computationally intensive task).

This is a great answer. Do you work in NLP?

Thanks. I'm in an advanced master programme for AI with a focus on language applications. I just paid attention in my Speech Recognition course :-).

No. A wakeword system is generally a binary acoustic model that uses MFCCs over time as features. It doesn't transcribe any text like ASR systems do. This is why a wakeword system is much lighter than a full ASR. Companies and individuals alike prefer to use a light acoustic binary classification system to spot the wakeword over having all words transcribed. In addition, wakeword systems can be much more robust than the average ASR system in respect to noise.

That's very cool. What are the actual specs of the code you created? I.e., how computationally expensive is this to run, and what are the approximate false positive and false negative rates?

Do you mean how computationally expensive is the model maker?

It's very expensive to run because it trains a model, it also extracts out data sets (ie converts a lot of files from mp3 to wav) and does a lot of other stuff (see the README for details). If you run Larynx TTS locally that also adds up. If you are using the common voice data set to train the model (which is highly recommended), it will take hours on a CPU. But once the model is trained the inference on the model is the same as for Precise. Well, much less if you use the tensorflow light version. We are also working on models that are further compressed, but there is always a balance there between compression and quality that is very important. We will see if that bears fruit in a future release.

As for the quality, it of course varies wildly depending on what data you are using, environmental features (ie microphone, settings, noise), the sensitivey settings you choose in the runner, and the wakeword itself. I personally aim for no more than 4 false positives per week in my living room (currently I seem to get only 1 or 2 per week). I don't actually remember the last time I had a false negative, even when the environment is very noisy or from a large distance away from the mic.

As these aren't general models, but specific models made from your data you collect for whatever wakeword you want, the measures can't be absolutely generalized. There are too many factors at play here. But I can say that it will work better than any other FOSS solution I have found or by doing it all manually for the same number of recordings using Precise.

The general quality criteria can be found in the README for the model maker as well as the data sets that are recommended and other helpful information. If you want to go into more detail on testing and results there is the wakeword project wiki.

I would love to hear back from the community as to the quality they are getting from their data. That would be very interesting.

Hope that is helpful!