DEV Community

DEV Community

Posted on Jun 2, 2022

The awesome speech recognition toolkit: Vosk!

What is vosk.

Vosk is a speech recognition toolkit supporting over 20 languages. The language model is 50MB light and easy to embed. So you will easily can do speech recognition completely offline.

Vosk provides bindings for Python, Java, C#, and also Node.js!

  • Supports 20+ languages and dialects
  • Works offline, even on lightweight devices - Raspberry Pi, Android, iOS

See Vosk's page for detail.

Install Vosk

Now you can try Vosk with Python! Vosk can be installed by pip. However, I prefer poetry , so I'll install it there.

⚠ Poetry will try to install the latest version (0.3.38). But that version is not compatible with MacOS. So I installed it by specifying the version to be installed by pip. (as of 2022-05-19)

And you can download the python module from Vosk examples .

Download the language model

The language model is available here . Extract the zip file and place it.

Prepare an audio file

You will need an audio file in the correct format - PCM 16khz 16bit mono.

If you are English speaker, you can get the test voice from Vosk example.

You can convert with ffmpeg .

Run the python module...

Done it!! 🎉 There are some differences. But, Vosk also recognized Japanese Kanji characters. 🀄

I'm a Japanese speaker, so recognized a Japanese audio file. The text of the audio is "ă”èŠ–èŽă‚ă‚ŠăŒăšă†ă”ă–ă„ăŸă—ăŸïŒă‚°ăƒƒăƒ‰ăƒœă‚żăƒłăšăƒăƒŁăƒłăƒăƒ«ç™»éŒČă‚ˆă‚ă—ăăŠéĄ˜ă„ă—ăŸă™ïŒ".

The complete commands is below.

The codes are on GitHub and Replit. I hope you'll enjoy Vosk too! Thank you.

kama-meshi / HelloVosk

Sample vosk repl with python..

This is a sample repl for Vosk with Python.

Sample voice

Let's recognize this voice đŸŽ€

"ă”èŠ–èŽă‚ă‚ŠăŒăšă†ă”ă–ă„ăŸă—ăŸïŒă‚°ăƒƒăƒ‰ăƒœă‚żăƒłăšăƒăƒŁăƒłăƒăƒ«ç™»éŒČă‚ˆă‚ă—ăăŠéĄ˜ă„ă—ăŸă™ïŒ"

And my repl is in replit .

https://replit.com/@kama-meshi/HelloVosk

Special Thanks

  • Voice: こえやさん

Top comments (0)

pic

Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

shaheryaryousaf profile image

Ethical AI: Bias and Fairness

Shaheryar - Mar 23

arindam_1729 profile image

Publish Your Articles on Medium with NodeJS

Arindam Majumder - Apr 19

charlesdebarros profile image

Back to Basics - Python #02

Charles De Barros - Apr 10

thomastaylor profile image

Anthropic Claude with tools using Python SDK

Thomas Taylor - Apr 21

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

No more Sphinx: Offline Speech Recognition with Vosk

No more Sphinx: Offline Speech Recognition with Vosk

The cmusphinx is no more. is it all overnot at all welcome vosk. so let's just build something with vosk and our trusty nltk, shall we.

Anuran Roy's photo

The long-lived and long-loved CMU Sphinx, a brainchild of Carnegie Mellon University, is not maintained actively anymore, since 5 years. But does that mean that we need to move to more production-oriented solutions? No, we actually don't. The team CMU Sphinx Project has slowly rolled in a new child project - Vosk .

Note that there are many other production-oriented solutions available (like OpenVINO, Mozilla DeepSpeech, etc.), which are equally as good, if not better at speech recognition. I am focusing on the ease of setup and use. 😅

Okay, I don't know what you are talking about. Please explain more.

Quoting the Official CMU Sphinx wiki's About section (forgive me for being lazy):

CMUSphinx Wiki About Section

I get it, but why do you call this dead?

This is the screenshot of the two most recent posts on the CMU Sphinx Official Blog :

CMUSphinx blog

Also this discussion from YCombinator:

YCombinator thread

Even if I disagree with the YCombinator discussion, the official CMU Sphinx blog does little to give me confidence.

Okay I get it. So what now?

Another screenshot from the main CMU Sphinx website :

CMUSphinx page head

Not gonna lie, I was pretty disappointed 🙁. I've been a Sphinx user for quite sometime. I'm no researcher, but I was actually familiar with Sphinx. So I wondered how Vosk would do for me. And I was really surprised at the gentle learning curve to implement Vosk to my apps. But there is really less documentation at the time of writing this blog. I hope this post will fill up some of that gap.

Anyways, enough chatter. So in this post, I am going to show you how to setup a simple Python script to recognize your speech, using it alongside NLTK to identify your speech and extract the keywords. The end result? A fully functional system that takes your voice input and processes it reasonably accurately, so that you can add voice control features to any awesome projects you may be building! 😃

Setting up:

Stage 0: resolving system-level dependencies:.

"Know thy tools." ~ Some great person.

Okay so before I start, let's see with what we'll be working on:

  • A Linux System (Ubuntu in my case). Windows and Mac users, don't be disheartened - the programming part is the same for all.
  • PulseAudio Audio Drivers
  • Python 3.8 with pip working.
  • A working Internet connection
  • An IDE (preferably) (VSCode in my case)
  • A microphone (or a headphone or earphone with an attached microphone)

So first, we need to install the appropriate pulseaudio , alsa and jack drivers, among others.

Assuming you're running Debian (or Ubuntu), type the following commands:

Note : Don't try to combine the above 2 statements (no pro-gamer move now 😜). libasound2-dev and jackd require swig to build their driver codes.

If you're familiar with CMU Sphinx, you'd realise that there are a lot of common dependencies - which is no coincidence. Vosk comes from Sphinx itself.

If you face some issues with installing swig , don't worry. Just Google your error with the keyword CMU Sphinx.

Stage 1: Setting up Vosk-API

First, we need to download Vosk-API. The Vosk API needs less setup, compared to the original source code.

Assuming you have git installed on your system, enter in your terminal:

If you don't have git, or have some other issues with it, download Vosk-API from here .

Create a project folder (say speech2command). Download (or clone) the Vosk-api code into a subfolder there.

Now extract the .zip file (or .tar.gz file) into your project folder (if you downloaded the source code as an archive).

Your directory structure should look something like this:

Now we're good to go.

Stage 2: Setting up a language model

The versatility of Vosk (or CMUSphinx) comes from its ability to use models to recognize various languages.

Simply put, models are the parts of Vosk that are language-specific and supports speech in different languages. At the time of writing, Vosk has support for more than 18 languages including Greek, Turkish, Chinese, Indian English, etc.

The Vosk Model Wiki

In this post, we are going to use the small American English model . It's compact (around 40 Mb) and reasonably accurate.

Download the model and extract it in your project folder. Rename the folder you extracted from the .zip file as model . Now, your directory structure should look like this:

Here is a video walkthrough (albeit a bit old):

Stage 3: Setting up Python Packages

For our project, we need the following Python packages:

  • Speech Recognition

The packages platform , sys and json come included in a standard Python 3 installation. We need to install the other packages manually.

In the command line, type:

Wait as the components get installed one by one.

Stage 4: Setting up NLTK Packages

Now NLTK is a huge package, with a dedicated index to manage its components. We just downloaded the NLTK core components to get a basic program up and running. We need a few more NLTK components to add to continue with the code.

The required packages are: stopwords , averaged_perceptron_tagger , punkt , and wordnet .

Or in one line:

Stage 5: Programming with Vosk and NLTK.

Here comes the fun part! Let's code something in Python to identify speech and convert it to text, using Vosk-API as the backend.

Make a new Python file (say s2c.py ) in your project folder. Now the project folder directory structure should look like:

Coding time now! đŸ€©

Okay, so the code for the project is given below. The code is pretty clean (or so I hope), and you can understand the code yourself (or just copy-paste it 😜).

Now run this code, and this will set up a listener that works continuously - with some verbose logs as well - which you can see on your terminal screen. Ignore those logs, they are just for information.

If you need the source code, I made a repo for it: Vosk Demo

Explanation:.

Here is a flowchart that shows exactly how this works:

Project mechanism flowchart

So this was it, folks! Enjoy your very own speech2text (or rather, speech2command) recognition system. Keep tinkering!

Bye for now! Ron

Did you find this article valuable?

Support Anuran Roy by becoming a sponsor. Any amount is appreciated!

Instantly share code, notes, and snippets.

@candideu

candideu / Open Source AI Scribe, Auto-Transcriber, Speech-to-text Transcriptions, Captions & Subtitles Exporter, Interactive Transcripts, Alternative to Otter.ai, Descript, Sonix.ai.md

  • Download ZIP
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Embed Embed this gist in your website.
  • Share Copy sharable link for this gist.
  • Clone via HTTPS Clone using the web URL.
  • Learn more about clone URLs
  • Save candideu/4a6525dfa9c2066cfc7c0e1bb7f41a4d to your computer and use it in GitHub Desktop.

Hello world!

As a video editor, researcher, digital media enthusiast, and lover of all things FLOSS, I've been on the hunt for an open source alternative to proprietary services like Otter.ai, Sonix, and Descript. I've pitched my idead on open-source-ideas , but I wanted to create a dedicated post for it so that it can reach as many people as possible.

Project description

A simple, easy-to-use application where users can dictate or upload audio or video files, and an automated transcript is generated. This transcript is synced to the audio track , clickable , and editable , so that users can skip to certain passages and refine the transcript accordingly.

The revised transcript can then be exported as plain text, .srt caption file (and other subtitle formats), .pdf, shareable web page, etc. for further processing.

Users can also provide their own language models, so that the number of possible languages that can be transcribed grows over time, as people create new models.

This application could be something you access from a browser and uses local storage, or a downloadble app (using something like Electron).

Inspiration, and the "Why"

As someone who works a lot with video and audio, and aims to make my work accessible, I'm a big fan of Otter.ai and Sonix.ai . They're very easy to use and provide pretty accurate transcriptions.

image

Issues, and what's missing in existing tools

That said, Otter and Sonix are not open-source, and their free tiers can be limiting. Both Otter and Sonix offer three lifetime uploads max, and Otter allows 40 minutes of live transcriptions per recording, with a max of 600 minutes a month (no rollover).

Otter only does transcriptions in English. Sonix does offer 37+ languages, but it doesn't look like you can provide your own language models. Other options like YouTube's automated transcriptions offer a wider range of languages, but that involves having to upload the media to YouTube, and there's no clickable transcript option.

Another issue is that some folks use automated transcriptions in their line of work, but cannot use cloud-based, proprietary software for legal reasons ( see this Reddit thread ).

Relevant Technology

I am in no way an expert, but it seems like Python would be relevant. That said, I'm open to any ideas, and open to having this be an application that's downloaded on your computer (with cross-platform support), or a web application that uses local storage, etc.

Speech-to-text

Vosk browser.

VOSK Browser is a speech recognition library running in the browser thanks to a WebAssembly build of Vosk. This implementation is probably the one I'm the most excited about because it's very close to what I had in mind. The demo they've created allows you to use your microphone or to upload an audio file to create the transcription. The cool thing about this approach is that you don't need to set up any loopback methods if you are using pre-recorded audio, because the demo seems to do it on its own.

According to the dev , "This project aims just to be a library that wraps a wasm build of vosk and the demo is just a demo of what can be done so I won't be adding such functionalities to the library itself. I have thought of integrating transcription with vosk-browser to oTranscribe which I guess would achieve what you want. I currently have no time for that but maybe someone can pick this up, would be really cool."

Potential ways to build upon this project:

  • Adding punctuation: I've found a number of punctuation restoration projects on here that could help with that such as punctuator2 and its many forks such as PunkProse . Punctuator2 even has a nifty demo which you can try out here . I also found an implementation of PunkProse + VOSK here .
  • Making the transcript editable
  • Adding timings that are synced to the audio (I assume that the live dictation would have to be recorded)
  • The ability to export the work as a subtitle/caption file

Check out the Demo: https://ccoreilly.github.io/vosk-browser/ View GitHub Repo: https://github.com/ccoreilly/vosk-browser

ideasman42/nerd-dictation

Uses VOSK API, but is for meant for Linux and uses the command-line to be installed. It also doesn't have a clickable transcript

Source code can be viewed here

saharmor/realtime-transcription-playground

Very similar to what I'm proposing, but uses Google's Speech API, which involves creating a service account and knowing how to use their Cloud Console.

image

Clickable, Interactive Transcript

Able Player is a fully accessible, open-source cross-browser HTML5 media player. It's not a text-to-speech API, but the player has a really neat clickable transcript feature that can be seen in the following example:

The source code can be viewed here .

Subtitle + Transcript Editors + Previewers

Otranscribe.

oTranscribe is one of the more well-known options in this space. It's a tool for manually transcribing audio interviews that allows you to import a video or audio file, and manually type the transcript. You can also add timestamps which can be clicked on to jump to that point in the audio/video. oTranscribe also features great keyboard shortcuts and playback tools to ease the transcription process.

image

There's even an oTranscribe for Electron fork that could be interesting to look into.

  • No speech-to-text
  • Cannot export to .srt (although an .otr to .srt conversion is possible with this external tool)
  • Cannot edit timestamps as text

View the website here: https://otranscribe.com/ View the repo here: https://github.com/oTranscribe

Hyperaudio seems to be working on an exciting suite of open interactive transcript tools which allow people to Navigate, Search and Edit transcripts!

I namely want to highlight the following tools, which could be of interest:

Hyperaudio Lite Editor: A lightweight transcript editor for editing and correcting STT generated timed transcripts

  • Repo: https://github.com/hyperaudio/hyperaudio-lite-editor

image

  • Repo: https://github.com/hyperaudio/hyperaudio-lite

image

  • Site: https://hyperaud.io/converter/converter.html
  • Repo: https://github.com/hyperaudio/ha-converter

Hyperaudio Website for now: https://lab.hyperaud.io/ Official Website: https://hyper.audio/

All arounders

The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API . That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive. I like the idea of using VOSK's API, but I think having a simple, dedicated application that works out of the box for automated transcriptions would be best, especially for people who aren't tech-savvy.

image

View their source code here: https://invent.kde.org/multimedia/kdenlive/-/tree/master/data/scripts

Video Transcriber

Video Transcriber is a Computer assisted video/audio transcription which, from what I can gather, seems to be what I have in mind. It's a prototype made with journalists and media professionals in mind.

Unfortunately, the demo link I found seems to be broken, so I haven't been able to test this one out. Testing this project otherwise would involve installing dependencies and creating an IBM Bluemix Account (which has monthly limits). The implementation I had in mind would be easy for non-technical users to use out-of-the-box.

image

View the repo: https://github.com/glitchdigital/video-transcriber

Complexity and required time

I'm not the most knowledgeable on these frameworks, so please let me know if I should tick other options for the complexity. That said, I'm open to helping with the design of the user interface.

  • Beginner - This project requires no or little prior knowledge of the technolog(y|ies) specified to contribute to the project
  • Intermediate - The user should have some prior knowledge of the technolog(y|ies) to the point where they know how to use it, but not necessarily all the nooks and crannies of the technology
  • Advanced - The project requires the user to have a good understanding of all components of the project to contribute

Required time (ETA)

  • Little work - A couple of days
  • Medium work - A week or two
  • Much work - The project will take more than a couple of weeks and serious planning is required
  • Frontend/UI
  • APIs/Backend
  • Voice Assistant
  • Developer Tooling
  • Extension/Plugin/Add-On
  • Futuristic Tech/Something Unique

My own programming (?) skills are limited to HTML, basic CSS, and the tiniest bit of Javascript. As such, I'm hoping to share my findings and proposed idea here in the hopes that more competent coders can bring this to life.

vosk-api VS vosk

Compare vosk-api vs vosk and see what are their differences..

alphacep logo

  • Infini-Gram: Scaling unbounded n-gram language models to a trillion tokens 4 projects | news.ycombinator.com | 5 May 2024
  • VOSK Offline Speech Recognition API 1 project | news.ycombinator.com | 13 Apr 2024
  • Apollo dev posts backend code to Git to disprove Reddit’s claims of scrapping and inefficiency 4 projects | /r/webdev | 9 Jun 2023
  • Working Vosk model? 1 project | /r/learnpython | 29 May 2023
So I don't know if my issue comes from my lack of knowledge of discord.js/voice or VOSK. so I guess the most important thing I need to see is if I am creating a proper stream for the Vosk API to capture the audio. if I can figure out how to capture an audio stream I can probably import that in to vosk and figure out how to use vosk myself. but right now I can't even get close! Thank you in advance...Sorry if this isn't the right place for this
I remember a while ago checking out the issues with Vosk speech recognition (written in C). A handful of it's issues are related to segfaults and null pointers.
first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement. checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally. the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information. besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech" https://github.com/alphacep/vosk-api
I did a one-off text to speech tool for someone last year and had pretty good results with VOSK. One upside is that it works offline, although I imagine if you use TTS a lot you'll notice issues I didn't.
You can use vosk-api (https://github.com/alphacep/vosk-api) to listen to your audio, transform it to text, and then post the text to GPT-3, then using the vector sdk, have your responses said by vector.
The set up script wants to download https://github.com/alphacep/vosk-api/releases/download/v0.3.45/vosk-model-en-v0.3.45.zip, but this resource is not found. AFAICT all releases never contained a model file. Remedy: hardcode one model from https://alphacephei.com/vosk/models. I guessed and picked the one with the closest name, vosk-model-en-us-0.22.zip, just so I could continue.
  • Speech-to-text software recommendations? 1 project | /r/software | 7 Jan 2021
Please consider integration with Vosk https://github.com/alphacep/vosk for offline speech recognition. It should be a good fit.

What are some alternatives?

whisper - Robust Speech Recognition via Large-Scale Weak Supervision

pocketsphinx - A small speech recognizer

Kaldi Speech Recognition Toolkit - kaldi-asr/kaldi is the official location of the Kaldi project.

AnySoftKeyboard - Android (f/w 2.1+) on screen keyboard for multiple languages.

vosk-server - WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries

simple-keyboard

TTS - 🐾💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

AutoSub - A CLI script to generate subtitle files (SRT/VTT/TXT) for any video using either DeepSpeech or Coqui

OpenBoard - 100% foss keyboard based on AOSP, with no dependency on Google binaries, that respects your privacy.

DeepSpeech - Install Mozilla DeepSpeech on a Raspberry Pi 4

hackerskeyboard - Hacker's Keyboard (official)

Vosk Speech Recognition Toolkit NuGet Package

Vosk is an offline open source speech recognition toolkit. It enables speech recognition models for 20+ languages and dialects - English, Indian English, German, French, Spanish, Portuguese, Chinese, Russian, Turkish, Vietnamese, Italian, Dutch, Catalan, Arabic, Greek, Farsi, Filipino, Ukrainian, Kazakh, Swedish, Japanese, Esperanto, Hindi, Czech. More to come. Vosk models are small (50 Mb) but provide continuous large vocabulary transcription, zero-latency response with streaming API, reconfigurable vocabulary and speaker identification. Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++ and others. Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews. Vosk scales from small devices like Raspberry Pi or Android smartphone to big clusters.

Got any Vosk Speech Recognition Toolkit Question?

Dependencies.

  • recognition
  • speech-to-text

Entity Framework Extensions

must-have-score

Avg-downloads-per-day, days-since-last-release.

Digitale SouverÀnitÀt

SouverÀn und selbstbestimmt leben im Digitalzeitalter

Open Source Spracherkennung mit Vosk

speech to text vosk

Habt ihr auch schon einmal den Gedanken gehabt, Eure Audiodateien einfach automatisch transkribieren zu lassen? Dann seid ihr wahrscheinlich – wie ich – frĂŒher oder spĂ€ter ĂŒber das Stichwort „Speech to Text“ gestolpert. „Speech to Text“ besagt im Endeffekt, dass eine Audiodatei in eine Software eingespielt und am Ende dann eine fertige Textdatei ausgegeben wird. Ich habe zwischenzeitlich mit Vosk und dem Github-Repository „ recasepun c “ eine Open Source Lösung fĂŒr Transkription inkl. Groß-/Kleinschreibungskorrektur gefunden. Basierend auf der Vorarbeit von John Singer konnte ich auch eine NutzeroberflĂ€che fĂŒr die Textererkennung bauen und auf meinem codeberg-Account veröffentlichen. Zur Nutzung braucht Ihr eine Python-Umgebung und etwas Erfahrung darin, mĂŒsst einige Pakete installieren (u.a. das ~ 700 MB große PyTorch) und die trainierten Modell herunterladen. Mehr dazu weiter unten.

Zu meiner Intention

Ärzte und AnwĂ€lte kennen das Problem, dass man fĂŒr den Beruf sehr viele SchriftstĂŒcke anfertigen muss und nicht alles tippen will. Blogger und Journalisten schreiben ja auch hĂ€ufig viel und haben dann entsprechend auch Bedarf an einer Vereinfachung des Prozesses. Generell spricht man ja eigentlich auch schneller, als man schreibt. Nur muss dann das Gesprochene irgendwie in Textform ĂŒberfĂŒhrt werden, den man dann verschicken kann oder fĂŒr andere Sachen nutzen kann. HĂ€ufig bedeutet das, dass eine Schreibkraft, oder man selbst, stundenlang die eigenen Monologe anhört und abtippt.

Wenn man jetzt – wie ich – niemanden hat oder anstellen will, der einem diese Arbeit abnimmt, dann wird das schon mal bisschen komplizierter. Noch vor ein paar Jahren waren die ersten Programme zur Audio-Erkennunge noch ziemlich rudimentĂ€r. Sie haben irgendwelche Sachen erkannt, nur meist nicht das, was man gesagt hatte. Das ist natĂŒrlich inzwischen viel besser geworden. Nicht zuletzt kennen ja auch viele Alexa oder Siri oder andere Sprachbefehl-Bots und -Dienste. Das heißt, das funktioniert inzwischen ganz gut. Auch Microsoft hat fĂŒr Windows eine eigene Texterkennung, die mit Mikrofon ganz einfach gesteuert werden kann und dann direkt in Word schreibt.

Ich weiß, es gibt inzwischen auch fĂŒr Handys kleine Apps die einem die Texterkennung abnehmen können. Allen Spracherkennungsdiensten ist natĂŒrlich gemein, dass immer die Frage mitschwingt, wie gut die spezifische FachausdrĂŒcke erkannt werden. Vor allem Mediziner haben das Problem, dass viele FachausdrĂŒcke in ihren Texten drin sind und natĂŒrlich auch erkannt werden mĂŒssen.

Warum Open Source?

Das Problem habe ich jetzt nicht, aber dafĂŒr habe ich ein Problem damit, wenn meine Audiodateien durch das Internet in irgendwelche Clouds geschickt werden – was bei den meisten dieser Dienste geschieht – und dann von irgendwelchen Firmen ausgewertet werden, ohne dass ich genau weiß, was mit meinen Daten passiert. Nun kann man sich fĂŒr den Computer oder auch fĂŒr sein Mobiltelefon eine Software kaufen, die dann offline die Texterkennung vornimmt.

Als Linux- und Sailfish-Nutzer ist das dann aber noch einmal etwas schwieriger, da meist keine native Software fĂŒr diese Betriebssysteme existiert. Und klar kann man eine virtuelle Maschine aufsetzen und eine Offline-Software darin nutzen. Aber wenn es sich vermeiden lĂ€sst, tue ich das und hoffe – wie auch bei anderer Software – auf eine Open Source Lösung, die dann in Linux angewendet werden kann. Zudem hatte ich schon lĂ€nger den Wunsch meine BlogbeitrĂ€ge auch einzusprechen und nicht alles eintippen zu mĂŒssen. Nach etwas Recherche habe ich jetzt etwas gefunden, was diese WĂŒnsche alle vereint. Es ist das Open Source Projekt Vosk , das ĂŒber eine Python-Schnittstelle genutzt werden kann.

Was steckt eigentlich hinter einer Texterkennung?

Hinter einer Texterkennung steckt das sogenannte „Natural language processing“ (NLP). Das heißt, ein Algorithmus wurde mit so vielen Audiodateien wie möglich darauf trainiert, den unterschiedlichen Frequenzen in einer Audiodatei Wörter zuzuordnen. Dem maschinellen Lernalgorithmus werden also verschiedenste Audiodateien mit dem richtigen Endergebnis zu dieser Audioaufnahme ĂŒbergeben. ZusĂ€tzlich wird die Anweisung gegeben, dass die fehlerhafte Zuordnung minimiert werden soll, sodass sich der Algorithmus in vielen Lernzyklen selbst trainieren kann. Einfach gesprochen: Die Maschine hört ganz viele Audiodateien und kriegt zusĂ€tzlich das Ergebnis mitgeteilt. So kann das Modell (maschinelles Lernen) ZusammenhĂ€nge zwischen Frequenzunterschieden und Buchstaben (bzw. Wörtern) erkennen lernen.

Um dieses Modell kontinuierlich zu trainieren, braucht es sehr viel Rechenleistung. Gleichzeitig will man verschiedene Modelle mit unterschiedlichen Einstellungen trainieren, um am Ende das beste Ergebnis aussuchen zu können. Das bedeutet, dass man noch mehr Rechenleistung und parallel arbeitende Systeme braucht. Das fertig trainierte Modell ist dann mit mehreren Gigabyte auch nicht gerade klein. Das sind die GrĂŒnde, warum die meisten Texterkennungs-Dienste auf eine Cloud beziehungsweise auf einen Server zurĂŒckgreifen. Dort wird dann der Text, den man eingesprochen hat zum Machine Learning Algorithmus geschickt und dieser gibt dann basierend auf seinem Erlernten eine Textdatei zurĂŒck.

Das heißt nicht nur wird die Quell-Datei verschickt, sondern es sind dann natĂŒrlich auch Firmen dabei, denen man jetzt vielleicht nicht unbedingt seine Audiodateien mitgeben möchte (Amazon ist mit AWS der grĂ¶ĂŸte Cloud-Hoster). Wer jetzt einen Amazon Echo Dot hat, fĂŒr den ist das sicherlich kein Problem. FĂŒr mich, der so etwas nicht hat und auch gerne die Hoheit ĂŒber seine Daten – ob Text oder Stimme und Sprache – behalten möchte, ist es eben bisschen schwieriger.

Der Open Source Ansatz von Vosk

Zu Weihnachten habe ich ein DiktiergerĂ€t geschenkt bekommen, so dass es jetzt konkreter wurde mit der Texterkennung. Nachdem das Machine Learning inzwischen auch durch den Fortschritt in der Computerhardware sehr weit fortgeschritten ist, wurden obige Probleme fĂŒr Open Source Enthusiasten nun auch einfacher zu handhaben. Und mit Vosk wurde Texterkennung ĂŒber eine Schnittstelle auch fĂŒr andere zugĂ€nglich gemacht. Das heißt in mĂŒhevoller Kleinarbeit haben Freiwillige verschiedene Rechenmodelle trainiert. Die fertigen Modelle können ganz nach Belieben genutzt werden, ob offline auf dem Handy oder PC oder um einen eigenen Server damit zu bestĂŒcken, der dann von Apps angesprochen werden kann. Wer möchte, kann aber auch mit den Infos der Vosk-Entwickler sein ganz eigenes Modell mit dem Code und eigenen Audiodateien trainieren.

Mit dem fertigen Modell kann dann fĂŒr eine beliebige Audiodatei eine Textvorhersage gemacht werden. Das klingt alles ein bisschen kompliziert. Es ist auch nicht ganz trivial, aber auch nicht so schwer, dass man es nicht verstehen kann. Das erste was man braucht ist ein Modell, das mit Audiodateien der gewĂŒnschten Sprache trainiert wurde. Es kann dann anhand der Frequenzunterschiede verschiedene Stimmen und Wörter unterscheiden, um das dann einem Text zuzuordnen. Dazu gibt es von Vosk die fertig trainierten Modelle und Schnittstellen in verschiedenen Programmiersprachen ĂŒber die dieses Modell angesteuert werden kann. Wie gesagt ein Open Source Projekt und eben auch eine Schnittstelle fĂŒr Python. Seit der Entwicklung von KYSA ist das ja meine Programmiersprache fĂŒr eigentlich alles, was ich programmiere.

Der schöne Vorteil an dem Ganzen ist natĂŒrlich, dass es erstens Open Source ist und zweitens auch offline funktioniert. Das heißt, sobald ich das Modell auf meinen PC runtergeladen habe, brauche ich keine Verbindung mehr zum Internet und meine Daten bleiben auf meinem Computer. Das ist fĂŒr mich ein sehr wichtiger Punkt. Zudem ist durch den Open Source Ansatz auch sichergestellt, dass alle diese Arbeit nutzen und verĂ€ndern können. NatĂŒrlich hat das – wie immer bei Open Source Projekten – auch Nachteile. So zum Beispiel, dass aufgrund der begrenzten finanziellen und zeitlichen Mittel die Ergebnisse auch nicht ganz so intuitiv und genau ausfallen wie bei einem Spezialprogramm von einer Firma, die mit Hunderten von Mitarbeitern daran arbeitet. Das muss am Ende jeder fĂŒr sich selbst abwĂ€gen.

Wie lÀsst sich Vosk nutzen?

Ich wollte es auf jeden Fall ausprobieren und habe mich mit dieser Vosk-Schnittstelle eingehender beschĂ€ftigt. DarĂŒber und ĂŒber Internetrecherchen bin ich auf den Blog von John Singer gestoßen. Er hatte scheinbar dasselbe Anliegen wie ich und war schon einen Schritt weiter: Er hat ein kleines Programm mit NutzeroberflĂ€che fĂŒr diese Vosk-Schnittstelle geschrieben. Dort kann man das heruntergeladene Modell einbinden und fĂŒr die Texterkennung einer Audiodatei (*.mp3 bzw. *.wav) nutzen. Als Ergebnis erhĂ€lt man eine JSON-Datei mit den erkannten Wörtern. Da Mp3-Dateien ja komprimierte Audiodateien sind, muss zur Erkennung dieser Dateien zusĂ€tzlich das ffmpeg-Paket installiert sein. Wie sich das installieren und mit dem Python-Code verknĂŒpfen lĂ€sst, erklĂ€rt John auf dieser Seite .

Ich habe das auch ausprobiert und es funktioniert ziemlich gut. Und es ist schon erstaunlich, wie viele Wörter richtig erkannt wurden, obwohl das Modell weder meine Stimme noch meinen Sprachgebrauch kannte. Vielmehr wurde es von anderen mit den unterschiedlichsten Audio-Schnipseln gefĂŒttert und trainiert.

Zeichensetzung und Schreibung

Nur wollte ich natĂŒrlich keine JSON-Datei mit Einzelwörtern, sondern einen Fließtext und wenn möglich mit richtiger Zeichensetzung und Groß-/Kleinschreibung. Also habe ich den Code angepasst, um eine *.txt-Datei als Ergebnis zu erhalten und weiter recherchiert, wie ich die Schreibung und Zeichensetzung zusĂ€tzlich vorhersagen lassen könnte. So habe ich herausgefunden, dass es ein weiteres Modell gibt. Diese Modell verarbeitet unformatierten Text Zeichen fĂŒr Zeichen, um die Zeichensetzung und die Groß- und Kleinschreibung zu berichtigen. Auch dieses Modell ist von Freiwilligen programmiert und trainiert worden. Es basiert maßgeblich auf der Arbeit von Benoit Favre, der den Code und Ergebnisse auf seinem Github anbietet. Gerade in der deutschen Sprache ist es ziemlich schwierig, herauszufinden, ob jetzt ein Punkt, Komma oder Fragezeichen zu dem Satz passt; und das, ohne dass man es einspricht. Erschwerend kommt dann auch noch die Großschreibung hinzu, die im Deutschen deutlich hĂ€ufiger vorkommt als im Englischen. Man muss dieses Modell also schon mit sehr vielen SĂ€tzen trainieren, um ein befriedigendes Ergebnis zu erhalten. Und auch hier war ich ĂŒberrascht, wie gut die Ergebnisse von diesem Modell sind. Insgesamt war das Training beider Modell natĂŒrlich sehr viel Arbeit, wofĂŒr ich denjenigen, die das gemacht haben auch sehr dankbar bin.

Meine Weiterentwicklung zur Nutzung beider Modelle

Ich habe nun also die einzelnen Bestandteile fĂŒr meine komplette Texterkennung kombiniert und in einer Fortentwicklung des kleinen Programms von John Singer verknĂŒpft. Diesen Code findet Ihr auf meinem Codeberg-Profil . ZunĂ€chst wird also aus der ausgewĂ€hlten Audio-Datei mit dem Vosk-Modell als Text erkannt, in einen zusammenhĂ€ngende Text-Variable (string) formatiert und in das Zeichensetzungs-Modell basierend auf Benoit Favres Arbeit gespeist. Dieses berichtigt dann die Zeichensetzung und Großschreibung in der Text-Variable und am Ende wird eine fertige *.txt-Datei an den Ort, an dem sich die Audiodatei befindet, ausgegeben. Diese kann dann in einer beliebigen Textverarbeitungssoftware geöffnet und korrigiert werden.

Damit das ganze auch mit verschiedenen Sprachen und Modellen funktioniert, bietet das Programm die Möglichkeit, das jeweils gewĂŒnschte Modell sowohl fĂŒr die Audioerkennung als auch fĂŒr die korrekte Zeichenerkennung separat einzubinden. So kann man – je nach Bedarf – auch wechseln und gucken, welche Modell bessere Erkennung liefern. Voraussetzung ist aber natĂŒrlich, dass beide Modelle lokal auf dem PC gespeichert sind. Und wer nur die Spracherkennung ohne Zeichensetzung/Großschreibung ausprobieren will, hat die Möglichkeit die Nutzung des zweiten Modells ĂŒber ein simples HĂ€kchen abzuwĂ€hlen. Dann gibt es als Endergebnis auch eine Textdatei, allerdings ohne Satzzeichen und in Kleinschreibung.

Bei alldem sollte man sich natĂŒrlich im Klaren sein, dass beide Modelle keine absolut korrekten SĂ€tze erkennen. Nicht zuletzt hĂ€ngt es auch von der eigenen Sprechweise ab, wie klar verstĂ€ndlich und damit erkennbar die Texte sind. Ich habe letzte Woche meinen ersten Text fĂŒr Draußen tut gut eingesprochen und wĂŒrde sagen, es hat zu neunzig Prozent gestimmt. Diesen Text, den ihr gerade lest, den habe ich auch eingesprochen und ĂŒber neunzig Prozent stimmte insgesamt (also mit Zeichensetzung und Schreibung). Nichtsdestotrotz muss der Text am Ende noch einmal ĂŒberarbeitet werden, aber dennoch ist es eine wahnsinnige Arbeitserleichterung.

Und damit das Ganze nicht nur von mir genutzt werden kann, sondern auch von alle Anderen, die sich dafĂŒr interessieren, gibt es den Code und eine kleine Anleitung zur Nutzung unter der oben genannten Codeberg-Quelle . Ein bisschen Ahnung von Python schadet dabei nicht, zur Einrichtung der nötigen Python-Umgebung gibt es aber von John Singer eine gute Schritt-fĂŒr-Schritt Anleitung . Damit alles funktioniert mĂŒsst Ihr eine virtuelle Umgebung einrichten und die nötigen Pakete, die in der Textdatei „requirements.txt“ aufgelistet sind, installieren. ZusĂ€tzlich braucht Ihr auch das ffmpeg-Package, um Mp3-Dateien transkribieren zu können. Wenn das alles vorhanden ist und Ihr Euch die trainierten Modelle heruntergeladen und lokal gespeichert habt, ist der Rest ĂŒber die NutzeroberflĂ€che sehr einfach. Je nach GrĂ¶ĂŸe der Audiodatei dauert der Transkriptions-Prozess dann nur einige Minuten.

Ich wĂŒnsche allen viel Spaß beim Transkribieren und freue mich natĂŒrlich auch ĂŒber Feedback und Eure Erfahrungen; gerne auch im Kommentar!

In diesem Sinne: Mit Vosk gibt es fĂŒr alle Interessierte eine super Open Source Sammlung fĂŒr die Texterkennung. Mit den bereits vortrainierten und herunterladbaren Modellen in verschiedenen Sprachen steht der offline-Nutzung dann auch nichts mehr im Wege. ZusĂ€tzlich gibt es basierend auf der Arbeit von Benoit Favre weitere Modelle, die zur Erkennung der Satzzeichen und Schreibung genutzt werden können. Die beiden Modelle sind insgesamt ca. 2,5 GB groß; den Platz muss man auf jeden Fall haben. DafĂŒr kann man dann aber seine Sprachdateien unabhĂ€ngig vom Internet und irgendwelchen Cloud-Anbietern transkribieren. Mit dem kleinen Programm, das ich basierend auf der Vorarbeit von John Singer auf meinem Codeberg-Account bereitstelle, ist der eigentliche Transkriptions-Prozess dann auch mit wenigen Mausklicks angestoßen. Am Ende erhĂ€lt man eine fertigen *.txt-Datei mit einem Fließtext und relativ akkurater Zeichensetzung.

11 Kommentare

Hallo Daniel, einfach Spitze, dass du dich da so reinkniest! Ich hatte auch schon ein Auge auf Vosk geworfen, aus den selben GrĂŒnden wie du, und freue mich, dass es mit der Spracherkennung jetzt so zĂŒgig weitergeht.

Hey Daniel nutze es auf windows bei der John Singer version werden ĂŒ Ă€ ö nicht erkannt daher verwende ich deine. Ich glaube es ist wegen der Groß Kleinschreibung oder hast du eine trainierte Sprache?

Könntest d u nicht die pytorch integrieren und eine exe erstellen und hochladen wÀre nett danke

Hallo egger, ja wahrscheinlich liegt das am zweiten Modell. Ich nutze kein Windows und eine Exe ist nicht geplant; wĂ€re ohnehin zu groß. Du kannst Dir Deine aber selbst mit pyinstaller machen. Das mit dem Prozessfehler liegt denke ich an der unterschiedlichen Prozesshandhabung bei der Wav-Umwandlung in Linux und Windows. Vielleicht habe ich nĂ€chste Woche Zeit nochmal nachzugucken, woran das liegen könnte.

Hey danke das dur dir zeit nimmst und es mal anschaust. Ja mit py installer habe ich eine exe datei erstellt jedoch wird die exe nicht ausgefĂŒhrt und es kommt eine fehlermeldung: file already exists but should not

Hi, es hat etwas gedauert, aber ich habe jetzt nicht finden können, woran es in Windows hakt. DafĂŒr habe ich das Programm um die Möglichkeit erweitert, mehrere Audio-Files auszuwĂ€hlen und nacheinander zu prozessieren. Wenn Du magst, kannst Du es mal testen.

Hey wenn ich eine wav datei konvertiere kommt eine fehler meldung: Windows: Object has no attribute ‚wavStereoFile‘ bei einer mp3 geht es problemlos danke

Hallo, danke fĂŒr den Hinweis. Da war wohl ein kleiner Fehler in der Wav-Dateiumwandlung. Habe es angepasst und auf Codeberg upgedatet. Es sollte jetzt eigentlich gehen. Hoffe es klappt und wĂŒnsche viel Spaß damit!

Hey jetzt komm eine andere fehlermeldung und lautet: Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird

Außeredm wenn ich die py zu exe umwandele wird diese nicht aufgerufen könntest du das auch anschauen danke

Hallo, ich kann den Fehler nicht reproduzieren. Bei mir funktioniert alles. Auf welchem System nutzt Du es? Exe kann nicht gemacht werden, dann mĂŒsste die PyTorch-Bibliothek integriert werden, die ist allein 400 MB groß. Du kannst die ursprĂŒngliche Version von John Singer probieren, gibts auch als Exe.

Schreibe einen Kommentar Antworten abbrechen

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

Kommentar *

Meinen Namen, meine E-Mail-Adresse und meine Website in diesem Browser fĂŒr die nĂ€chste Kommentierung speichern.

speech_to_text package

  • documentation
  • speech_to_text

pub package

A library that exposes device specific speech recognition capability.

This plugin contains a set of classes that make it easy to use the speech recognition capabilities of the underlying platform in Flutter. It supports Android, iOS and web. The target use cases for this library are commands and short phrases, not continuous spoken conversion or always on listening.

Platform Support

build: means you can build and run with the plugin on that platform

speech: means most speech recognition features work. Platforms with build but not speech report false for initialize

* Only some browsers are supported, see here

Recent Updates

6.6.0 listen now uses 'SpeechListenOptions' to specify the options for the current listen session, including new options for controlling haptics and punctuation during recognition on iOS.

6.5.0 New initialize option to improve support for some mobile browsers, SpeechToText.webDoNotAggregate . Test the browser user agent to see if it should be used.

Note : Feedback from any test devices is welcome.

To recognize text from the microphone import the package and call the plugin, like so:

Complete Flutter example

Example apps.

In the example directory you'll find a few different example apps that demonstrate how to use the plugin.

Basic example ( example/lib/main.dart )

This shows how to initialize and use the plugin and allows many of the options to be set through a simple UI. This is probably the first example to look at to understand how to use the plugin.

Provide example ( example/lib/provider_example.dart )

If you are using the (Provider) https://pub.dev/packages/provider package in Flutter then this example shows how to use the plugin as a provider throught the SpeechToTextProvider class.

Plugin stress test ( example/lib/stress.dart )

The plugin opens and closes several platform resources as it is used. To help ensure that the plugin does not leak resources this stress test loops through various operations to make it easier to track resource usage. This is mostly an internal development tool so not as useful for reference purposes.

Audio player interaction ( examples/audio_player_interaction/lib/main.dart )

A common use case is to have this plugin and an audio playback plugin working together. This example shows one way to make them work well together. You can find this in

Initialize once

The initialize method only needs to be called once per application session. After that listen , start , stop , and cancel can be used to interact with the plugin. Subsequent calls to initialize are ignored which is safe but does mean that the onStatus and onError callbacks cannot be reset after the first call to initialize . For that reason there should be only one instance of the plugin per application. The SpeechToTextProvider is one way to create a single instance and easily reuse it in multiple widgets.

Permissions

Applications using this plugin require user permissions.

Add the following keys to your Info.plist file, located in <project root>/ios/Runner/Info.plist :

  • NSSpeechRecognitionUsageDescription - describe why your app uses speech recognition. This is called Privacy - Speech Recognition Usage Description in the visual editor.
  • NSMicrophoneUsageDescription - describe why your app needs access to the microphone. This is called Privacy - Microphone Usage Description in the visual editor.

Add the record audio permission to your AndroidManifest.xml file, located in <project root>/android/app/src/main/AndroidManifest.xml .

  • android.permission.RECORD_AUDIO - this permission is required for microphone access.
  • android.permission.INTERNET - this permission is required because speech recognition may use remote services.
  • android.permission.BLUETOOTH - this permission is required because speech recognition can use bluetooth headsets when connected.
  • android.permission.BLUETOOTH_ADMIN - this permission is required because speech recognition can use bluetooth headsets when connected.
  • android.permission.BLUETOOTH_CONNECT - this permission is required because speech recognition can use bluetooth headsets when connected.

Android SDK 30 or later

If you are targeting Android SDK, i.e. you set your targetSDKVersion to 30 or later, then you will need to add the following to your AndroidManifest.xml right after the permissions section. See the example app for the complete usage.

Adding Sounds for iOS (optional)

Android automatically plays system sounds when speech listening starts or stops but iOS does not. This plugin supports playing sounds to indicate listening status on iOS if sound files are available as assets in the application. To enable sounds in an application using this plugin add the sound files to the project and reference them in the assets section of the application pubspec.yaml . The location and filenames of the sound files must exactly match what is shown below or they will not be found. The example application for the plugin shows the usage. Note These files should be very short as they delay the start / end of the speech recognizer until the sound playback is complete.

  • speech_to_text_listening.m4r - played when the listen method is called.
  • speech_to_text_cancel.m4r - played when the cancel method is called.
  • speech_to_text_stop.m4r - played when the stop method is called.

Switching Recognition Language

The speech_to_text plugin uses the default locale for the device for speech recognition by default. However it also supports using any language installed on the device. To find the available languages and select a particular language use these properties.

There's a locales property on the SpeechToText instance that provides the list of locales installed on the device as LocaleName instances. Then the listen method takes an optional localeId named param which would be the localeId property of any of the values returned in locales . A call looks like this:

Troubleshooting

Speech recognition not working on ios simulator.

If speech recognition is not working on your simulator try going to the Settings app in the simulator: Accessibility -> Spoken content -> Voices

From there select any language and any speaker and it should download to the device. After that speech recognition should work on the simulator.

Speech recognition stops after a brief pause on Android

Android speech recognition has a very short timeout when the speaker pauses. The duration seems to vary by device and version of the Android OS. In the devices I've used none have had a pause longer than 5 seconds. Unfortunately there appears to be no way to change that behaviour.

Android beeps on start/stop of speech recognition

This is a feature of the Android OS and there is no supported way to disable it.

Android build

Version 5.2.0 of the plugin and later require at least compileSdkVersion 31 for the Android build. This property can be set in the build.gradle file.

Continuous speech recognition

There have been a number of questions about how to achieve continuous speech recognition using this plugin. Currently the plugin is designed for short intermittent use, like when expecting a response to a question, or issuing a single voice command. Issue #63 is the current home for that discussion. There is not yet a way to achieve this goal using the Android or iOS speech recognition capabilities.

There are at least two separate use cases for continuous speech recognition:

  • voice assistant style, where recognition of a particular phrase triggers an interaction;
  • dictation of text for input.

Voice assistant style interaction is possibly better handled by integrating with the existing assistant capability on the device rather than building out a separate capability. Text dictation is available through the keyboard for standard text input controls though there are other uses of dictation that are not currently well supported.

Browser support for speech recognition

Web browsers vary in their level of support for speech recognition. This issue has some details. The best lists I've seen are https://caniuse.com/speech-recognition and https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition . In particular in issue #239 it was reported that Brave Browser and Firefox for Linux do not support speech recognition.

Speech recognition from recorded audio

There have been a number of questions about whether speech can be recognized from recorded audio. The short answer is that this may be possible on iOS but doesn't appear to be on Android. There is an open issue on this here #205.

iOS interactions with other sound plugins, crash when listening or initializing, pauses

On iOS the speech recognition plugin can interact with other sound plugins, things like WebRTC, or sound playback or recording plugins. While this plugin tries hard to be a good citizen and properly share the various iOS sound resources there is always room for interactions. One thing that might help is to add a brief delay between the end of another sound plugin and starting to listen using SpeechToText. See this issue for example.

SDK version error trying to compile for Android

The speech_to_text plugin requires at least Android SDK 21 because some of the speech functions in Android were only introduced in that version. To fix this error you need to change the build.gradle entry to reflect this version. Here's what the relevant part of that file looked like as of this writing:

Recording audio on Android

It is not currently possible to record audio on Android while doing speech recognition. The only solution right now is to stop recording while the speech recognizer is active and then start again after.

Incorrect Swift version trying to compile for iOS

This happens when the Swift language version is not set correctly. See this thread for help https://github.com/csdcorp/speech_to_text/issues/45 .

Swift not supported trying to compile for iOS

This usually happens for older projects that only support Objective-C. See this thread for help https://github.com/csdcorp/speech_to_text/issues/88 .

Last word lost on Android

There's a discussion here https://github.com/csdcorp/speech_to_text/issues/434 about this known issue with some Android speech recognition. This issue is up to Google and other Android implementers to address, the plugin can't improve on their recognition quality.

Not working on a particular Android device

The symptom for this issue is that the initialize method will always fail. If you turn on debug logging using the debugLogging: true flag on the initialize method you'll see 'Speech recognition unavailable' in the Android log. There's a lengthy issue discussion here https://github.com/csdcorp/speech_to_text/issues/36 about this. The issue seems to be that the recognizer is not always automatically enabled on the device. Two key things helped resolve the issue in this case at least.

Not working on an Android emulator

The above tip about getting it working on an Android device is also useful for emulators. Some users have reported seeing another error on Android simulators - sdk gphone x86 (Pixel 3a API 30). AUDIO_RECORD perms were in Manifest, also manually set Mic perms in Android Settings. When running sample app, Initialize works, but Start failed the log looks as follows.

Resolved by

Resolved it by Opening Google, clicking Mic icon and granting it perms, then everything on the App works...

  • Go to Google Play
  • Search for 'Google'
  • You should find this app: https://play.google.com/store/apps/details?id=com.google.android.googlequicksearchbox If 'Disabled' enable it

This is the SO post that helped: https://stackoverflow.com/questions/28769320/how-to-check-wether-speech-recognition-is-available-or-not

Ensure the app has the required permissions. The symptom for this that you get a permanent error notification 'error_audio_error` when starting a listen session. Here's a Stack Overflow post that addresses that https://stackoverflow.com/questions/46376193/android-speechrecognizer-audio-recording-error Here's the important excerpt:

You should go to system setting, Apps, Google app, then enable its permission of microphone.

User reported steps

From issue #298 this is the detailed set of steps that resolved their issue:

  • install google app
  • Settings > Voice > Languages - select the language
  • Settings > Voice > Languages > Offline speech recognition - install language
  • Settings > Language and region - select the Search language and Search region
  • Delete the build folder from the root path of the project and run again

iOS recognition guidelines

Apple has quite a good guide on the user experience for using speech, the original is here https://developer.apple.com/documentation/speech/sfspeechrecognizer This is the section that I think is particularly relevant:

Create a Great User Experience for Speech Recognition Here are some tips to consider when adding speech recognition support to your app.
Be prepared to handle failures caused by speech recognition limits. Because speech recognition is a network-based service, limits are enforced so that the service can remain freely available to all apps. Individual devices may be limited in the number of recognitions that can be performed per day, and each app may be throttled globally based on the number of requests it makes per day. If a recognition request fails quickly (within a second or two of starting), check to see if the recognition service became unavailable. If it is, you may want to ask users to try again later.
Plan for a one-minute limit on audio duration. Speech recognition places a relatively high burden on battery life and network usage. To minimize this burden, the framework stops speech recognition tasks that last longer than one minute. This limit is similar to the one for keyboard-related dictation. Remind the user when your app is recording. For example, display a visual indicator and play sounds at the beginning and end of speech recognition to help users understand that they're being actively recorded. You can also display speech as it is being recognized so that users understand what your app is doing and see any mistakes made during the recognition process.
Do not perform speech recognition on private or sensitive information. Some speech is not appropriate for recognition. Don't send passwords, health or financial data, and other sensitive speech for recognition.
  • balanced_alternates
  • speech_recognition_error
  • speech_recognition_event
  • speech_recognition_result
  • speech_to_text_provider
  • speech_to_text_web

Using Google Cloud Speech-to-Text to transcribe your Twilio calls in real-time

Mark shalda.

Technical Program Manager & ML Partner Engineering Lead

Developers have asked us how they can use Google Cloud’s Speech-to-Text to transcribe speech (especially phone audio) coming from Twilio , a leading cloud communications PaaS. We’re pleased to announce that it’s now easier than ever to integrate live call data with Google Cloud’s Speech-to-Text using Twilio’s Media Streams.

The new TwiML <stream> command streams call audio to a websocket server. This makes it simple to move your call audio from your business phone system into an AI platform that can transcribe that data in real time and use it for use cases like helping contact center agents and admins, as well as store it for later analysis. 

When you combine this new functionality with Google Cloud’s Speech-to-Text abilities and other infrastructure and analytics tools like BigQuery, you can create an extremely scalable, reliable and accurate way of getting more value from your audio.

Architecture

The overall architecture for creating this flow looks something like what you see below. Twilio creates and manages the inbound phone number. Their new Stream command takes the audio from an incoming phone call and sends it to a configured websocket which runs on a simple App Engine flexible environment. From there, sending the audio along as it comes to Cloud Speech-to-Text is not very challenging. Once a transcript is created, it’s stored in BigQuery where real-time analysis can be performed.

https://storage.googleapis.com/gweb-cloudblog-publish/images/twilio_overall_architecture.max-1200x1200.png

Configuring your phone number

Once you’ve bought a number in Twilio, you’ll need to configure your phone number to respond with TwiML , which stands for Twilio Markup Language. It’s a tag-based language much like HTML, which will pass off control via a webhook that expects TwiML that you provide.

Next, navigate to your list phone numbers and choose your new number. On the number settings screen, scroll down to the Voice section. There is a field labelled “A Call Comes In”. Here, choose TwiML Bin from the drop down and press the plus button next to the field to create a new TwiML Bin.

Creating a TwiML Bin

TwiML Bins are a serverless solution that can seamlessly host TwiML instructions. Using a TwiML Bin prevents you from needing to set up a webhook handler in your own web-hosted environment.

Give your TwiML Bin a Friendly Name that you can remember later. In the Body field, enter the following code, replacing the url attribute of the <Stream> tag and the phone number contained in the body of the <Dial> tag.

The <Stream> tag starts the audio stream asynchronously and then control moves onto the <Dial> verb. <Dial> will call that number. The audio stream will end when the call is completed.

Save your TwiML Bin and make sure that you see your Friendly Name in the “A Call Comes In“ drop down next to TwiML Bin. Make sure to Save your phone number.

Setup in Google Cloud

This setup can either be done in an existing Google Cloud project or a new project. To set up a new project, follow the instructions here . Once you have the project selected that you want to work in, you’ll need to set up a few key things before getting started:

Enable APIs for Google Speech-to-Text. You can do that by following the instructions here and searching for “Cloud Speech-to-Text API”.

Create a service account for your App Engine flexible environment to utilize when accessing other Google Cloud services. You’ll need to download the private key as a JSON file as well.

Add firewall rules to allow your App Engine flexible environment to accept incoming connections for the websocket. A command like the following should work from a gcloud enabled terminal:

gcloud compute firewall-rules create default-allow-websockets-8080 --allow tcp:8080 --target-tags websocket --description "Allow websocket traffic on port 8080"

App Engine flexible environment setup

For the App Engine application, we will be taking the sample code from Twilio’s repository to create a simple node.js websocket server. You can find the github page here with instructions on environment setup. Once the code is in your project folder, you’ll need to do a few more things to deploy your application:

Place the service account JSON key you downloaded earlier, rename it to “google_creds.json”, and put it in the same directory as the node.js code.

Create an app.yaml file that looks like the following:

runtime: nodejs

manual_scaling:

  instances: 1

  instance_tag: websocket

https://storage.googleapis.com/gweb-cloudblog-publish/images/App_Engine_flexible_environment_setup.max-400x400.png

Once these two items are in order, you will be able to deploy your application with the command:

gcloud app deploy

Once deployed, you can tail the console logs with the command:

gcloud app logs tail -s default

Verifying your stream is working

Call your Twilio number, and you should immediately be connected with the number specified in your TwiML. You should see a websocket connection request made to the url specified in the <Stream>. Your websocket should immediately start receiving messages. If you are tailing the logs in the console, the application will log the intermediate messages as well as any final utterances detected by Google Cloud’s Speech-to-Text API.

Writing transcriptions to BigQuery

In order to analyze the transcripts later, we can create a BigQuery table and modify the sample code from Twilio to write to that table. Instructions for creating a new BigQuery table can be found here . Given the way Google Speech-to-Text creates transcription results, a potential schema for the table might look like the following.

https://storage.googleapis.com/gweb-cloudblog-publish/images/Writing_transcriptions_to_BigQuery.max-1200x1200.jpg

Once a table like this exists, you can modify the Twilio sample code to also stream data to the BigQuery table using sample code found  here .

Twilio’s new Stream function allows users to quickly make use of the real time audio that is moving through their phone systems. Paired with Google Cloud, that data can be transcribed in real time and passed on to numerous other applications. This ability to get high quality transcription in real time can benefit businesses—from helping contact center agents document and understand phone calls, to analyzing data from the transcripts of those calls. 

To learn more about Cloud Speech-to-Text,  visit our website .

  • AI & Machine Learning
  • Google Cloud

Related articles

https://storage.googleapis.com/gweb-cloudblog-publish/images/DO_NOT_USE_CUxs9oC.max-700x700.jpg

Product analytics for generative AI model and media asset companies using BigQuery

By Annie Xu ‱ 9-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/global_2022_TxXQShm.max-700x700.jpg

Introducing the Verified Peering Provider program, a simple alternative to Direct Peering

By Dave Schwartz ‱ 7-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/Next24_Blog_blank_2-04.max-700x700.jpg

Google Cloud partners fuel media and entertainment boom: Viewers reap the rewards

By Kip Schauer ‱ 4-minute read

https://storage.googleapis.com/gweb-cloudblog-publish/images/Next24_Blog_Images_6-06.max-700x700.jpg

Run AI anywhere with Google Distributed Cloud innovations

By Vithal Shirodkar ‱ 7-minute read

speech to text vosk

Vocalware's TTS supports SSML tags, which allow you to control the manner in which the text in your app is spoken. Below are a few examples.

Click on a tag below to insert an example in to the text box:

There are many more SSML tags. Listed here are only those tags which are supported by all of our voices. Additional tags may be supported by a subset of our voices, feel free to experiment.

How It Works

API Reference

Contact support

Privacy Policy

Terms of Use

© 2024 Oddcast, Inc.

speech to text vosk

Contact sales

speech to text vosk

IMAGES

  1. VOSK Offline Speech Recognition, Speech To Text for Linux Android iOS Mac OSX Windows

    speech to text vosk

  2. GitHub

    speech to text vosk

  3. How To Make Offline Speech Recognition in Python Using Vosk

    speech to text vosk

  4. Speech-to-Text

    speech to text vosk

  5. Automatic Speech Recognition with Vosk

    speech to text vosk

  6. How to use #Vosk -- the Offline Speech Recognition Library for Python

    speech to text vosk

VIDEO

  1. Powerful Study Motivational Video #motivation #shorts #explore #line

  2. à€°à€”à€ż à€•à€żà€¶à€š funny speech#shorts

  3. Kingdom Living A Life of Mercy

  4. ELÁN

  5. Ń€Đ”Ń‡ŃŒ ĐČ Ń‚Đ”Đșст python

  6. Orange Gets a Microphone Array with Neopixel Ring, and New Offline Speech Recognition

COMMENTS

  1. GitHub

    Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++, Rust, Go and others. Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.

  2. Real-Time Whisper Voice Recognition with vosk model feedback

    Real-Time usage scenarios (like a voice assistant for example) requires a GPU with at least 2-4~ gb of vram. The more the vram, the largest the model you can load, the better the transcription and the slower it gets.

  3. The awesome speech recognition toolkit: Vosk!

    Vosk is a speech recognition toolkit supporting over 20 languages. The language model is 50MB light and easy to embed. So you will easily can do speech recognition completely offline. Vosk provides bindings for Python, Java, C#, and also Node.js! Supports 20+ languages and dialects. Works offline, even on lightweight devices - Raspberry Pi ...

  4. Offline Speech Recognition with Vosk

    Stage 5: Programming with Vosk and NLTK. Here comes the fun part! Let's code something in Python to identify speech and convert it to text, using Vosk-API as the backend. Make a new Python file (say s2c.py) in your project folder. Now the project folder directory structure should look like:

  5. [D] Some questions about Vosk speech to text : r/MachineLearning

    I'm pretty familiar with Vosk (which itself is just a wrapper around Kaldi). Its accuracy has been surpassed by newer models (see the K2 project), but if you need something small/fast (or you can only train your model on a small dataset), then it's difficult to beat. ... Dissecting BARK - whats inside SOTA Text-to-Speech.

  6. speech recognition

    Some Background on my project: I'm working on a Linguistic AI project. I needed a speech recognition engine to convert spoken words into text. I started using CMUSphinx. PocketSphinx to be more precise. I like pocketsphinx but I was told that it is obsolete and that vosk is much better. However, pocketsphinx is very easy to use in terms of ...

  7. Open Source AI Scribe / Auto-Transcriber / Speech-to-text

    The open-source video editor introduced a speech-to-text module in version 21.04 using VOSK, an offline speech-recognition API. That said, the feature is still pretty new and kind of buggy. It also involves having to download Python and knowing how to use Kdenlive.

  8. Voice Recognition Systems Compared: Google vs Yandex vs Vosk vs Sphinx

    2.3 Yandex Speech Kit. Yandex Speech Kit is a set of tools for offline speech recognition. It includes Qoldi as one of its components. Yandex Speech Kit offers two Russian models: a portable model with a size of 50 megabytes and a server model with a size of 3 gigabytes. The portable model is suitable for offline use on user devices, while the ...

  9. How To Make Offline Speech Recognition in Python Using Vosk

    Hello Everyone Iss Video maine aapko bataya hai offline speech recognition ke baare main. Vosk ek offline speech recognition toolkit h jo aapko offline spee...

  10. vosk-api vs vosk

    vosk-server - WebSocket, gRPC and WebRTC speech recognition server based on Vosk and Kaldi libraries simple-keyboard. TTS - 🐾💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production OpenBoard - 100% foss keyboard based on AOSP, with no dependency on Google binaries, that respects your privacy.

  11. Vosk Speech Recognition Toolkit

    Speech recognition bindings implemented for various programming languages like Python, Java, Node.JS, C#, C++ and others. Vosk supplies speech recognition for chatbots, smart home appliances, virtual assistants. It can also create subtitles for movies, transcription for lectures and interviews.

  12. Open Source Spracherkennung mit Vosk

    „Speech to Text" besagt im Endeffekt, dass eine Audiodatei in eine Software eingespielt und am Ende dann eine fertige Textdatei ausgegeben wird. Ich habe zwischenzeitlich mit Vosk und dem Github-Repository „ recasepun c " eine Open Source Lösung fĂŒr Transkription inkl.

  13. speech_to_text

    speech_to_text. A library that exposes device specific speech recognition capability. This plugin contains a set of classes that make it easy to use the speech recognition capabilities of the underlying platform in Flutter. It supports Android, iOS and web. The target use cases for this library are commands and short phrases, not continuous ...

  14. Nikse.dk

    Audio to text (speech recognition) via Whisper or Vosk/Kaldi; Auto Translation via Google translate; Rip subtitles from a (decrypted) dvd; Import and OCR VobSub sub/idx binary subtitles; Import and OCR Blu-ray .sup files - bd sup reading is based on Java code from BDSup2Sub by 0xdeadbeef) Can open subtitles embedded inside Matroska files

  15. Information

    Table 1 provides a sample from the test set. 3.2. Speech Recognition System. In this study, we evaluated the performance of various off-the-shelf speech recognition systems, namely Google speech-to-text API, VOSK API, QuartzNet, Wav2vec2.0, and CRDNN pre-trained model, on the MoroccanFrench corpus.

  16. Speech recognition save speech to wav too faster

    When I run this script on my computer, the library records the voice at normal speed. When I do it on OrangePI Zero 3, it records the voice at 1.5-2 times the normal speed. The only difference betw...

  17. Offline & Real-Time Speech-to-Text Running on Raspberry Pi Zero

    This demo shows Picovoice offline & real-time speech-to-text engine (Cheetah) running on Raspberry Pi Zero without an Internet connection.For more informatio...

  18. Yahor Talkachou on LinkedIn: Local, all-in-one Go speech-to-text

    Local continuous speech-to-text recognition with Go, Vosk, and gRPC streaming link.medium.com

  19. Using Google Cloud Speech-to-Text to transcribe your Twilio calls in

    Next, navigate to your list phone numbers and choose your new number. On the number settings screen, scroll down to the Voice section. There is a field labelled "A Call Comes In". Here, choose TwiML Bin from the drop down and press the plus button next to the field to create a new TwiML Bin.

  20. Preview our Text-to-Speech Voices & Features

    Vocalware lets developers speech-enable any online application by using our powerful online API. Sign up now for your 15 day Free Trial! ... Try Vocalware's demo to sample our text-to-speech voices and our Audio Effects. Select from over 20 languages and more than 100 voices! 600 characters left . Optional: Audio Effects: Vocalware supports ...