Apple macOS Speech Recognition Tools
Summary
Modern Apple devices can be used for AI based speech transcription. This guide shows two possibilities, but this is not an overview of all the different possibilities.
Speech Recognition
We look at two CLI tools: yap and whisper.cpp.
Yap
Yap is CLI tool for on-device speech transcription.
Installation is really simple with brew:
brew install yap
Usage is also straightforward for a simple transcription.
yap audio.wav
Set the locale if the language is not set to the default of the region.
yap transcribe --locale=en_US --json audio.mp3 -o audio.json
To make it work with other languages, the locale can get another value, but first the language needs to be activated in the "language & Region" menu. This is hidden behind the "Translation languages…" button at the end.
Now we can use it for example for a German voice recording.
yap transcribe --locale=de_DE german.wav
This works really fast and uses the on-device model and accelerator.
A small disadvantage is that there is no control over the granularity of the
time resolution. So with options like --srt we can get time ranges for groups
of words.
Whisper.cpp
Whisper.cpp is a tool for automatic speech recognition inference using OpenAI's Whisper model.
Installation for newer Apple M devices should be done from source in order to activate Core ML.
gh repo clone ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release
And then also a model is needed.
For example, the ggml-large-v3-turbo-q5_0.
This requires both the bin file and the mlmodelc archive, which can be obtained from
Hugging Face.
We can put those both into the model folder of the whisper project.
mv path/to/ggml-large-v3-turbo-q5_0.bin models/
unzip -d models/ path/to/ggml-large-v3-turbo-encoder.mlmodelc.zip
Usage:
.build/bin/whisper-cli --model models/ggml-large-v3-turbo-q5_0.bin \
--file audio.mp3 --output-json --output-file audio
But now we can control the length with --max-len N.
.build/bin/whisper-cli --model models/ggml-large-v3-turbo-q5_0.bin \
--file audio.mp3 --output-json --output-file audio --max-len 1
This still works slower than yap, but it offers a few more features which might be relevant.
Conclusion
To only get the text, use yap. If more detailed analysis of the speech file is required, setting up whisper.cpp might be justified.