Summary

Modern Apple devices can be used for AI based speech transcription. This guide shows two possibilities, but this is not an overview of all the different possibilities.

Speech Recognition

We look at two CLI tools: yap and whisper.cpp.

Yap

Yap is CLI tool for on-device speech transcription.

Installation is really simple with brew:

brew install yap

Usage is also straightforward for a simple transcription.

yap audio.wav

Set the locale if the language is not set to the default of the region.

yap transcribe --locale=en_US --json audio.mp3 -o audio.json

To make it work with other languages, the locale can get another value, but first the language needs to be activated in the "language & Region" menu. This is hidden behind the "Translation languages…" button at the end.

Now we can use it for example for a German voice recording.

yap transcribe --locale=de_DE german.wav

This works really fast and uses the on-device model and accelerator. A small disadvantage is that there is no control over the granularity of the time resolution. So with options like --srt we can get time ranges for groups of words.

Whisper.cpp

Whisper.cpp is a tool for automatic speech recognition inference using OpenAI's Whisper model.

Installation for newer Apple M devices should be done from source in order to activate Core ML.

gh repo clone ggml-org/whisper.cpp.git
cd whisper.cpp
cmake -B build -DWHISPER_COREML=1
cmake --build build -j --config Release

And then also a model is needed. For example, the ggml-large-v3-turbo-q5_0. This requires both the bin file and the mlmodelc archive, which can be obtained from Hugging Face.

We can put those both into the model folder of the whisper project.

mv path/to/ggml-large-v3-turbo-q5_0.bin models/
unzip -d models/ path/to/ggml-large-v3-turbo-encoder.mlmodelc.zip

Usage:

.build/bin/whisper-cli --model models/ggml-large-v3-turbo-q5_0.bin \
  --file audio.mp3 --output-json --output-file audio

But now we can control the length with --max-len N.

.build/bin/whisper-cli --model models/ggml-large-v3-turbo-q5_0.bin \
  --file audio.mp3 --output-json --output-file audio --max-len 1

This still works slower than yap, but it offers a few more features which might be relevant.

Conclusion

To only get the text, use yap. If more detailed analysis of the speech file is required, setting up whisper.cpp might be justified.