The base configuration of ggml-medium.bin spans a file size of roughly . However, developers frequently use quantized versions (such as ggml-medium-q5_0.bin or ggml-medium-q8_0.bin ), which reduce the storage size to 540MB–830MB to achieve faster inference speeds with marginal quality loss. How Does ggml-medium.bin Work Under the Hood?
If transcription takes longer than the actual duration of the audio clip, your CPU may be bottlenecked by memory bandwidth. Consider downloading the Q5_0 quantized model version to reduce your system's hardware overhead.
It uses the GGML tensor library format, designed for efficient inference on a wide range of platforms (macOS, iOS, Android, Linux, Windows).
When running a "medium" sized model (roughly 3B to 13B parameters), the memory bandwidth is the bottleneck, not the math itself.