Mirakram Aghalarov, Mahammad Mehdi, Javidan Zeynalov, Sabuhi Aghayev
Pichilti – monolingual Azerbaijani distilled Whisper model
The recent saturation in the development of Speech-to-Text (STT) models has been disrupted by the release of multilingual models trained using zero-shot learning. While these models offer powerful and robust capabilities for voice processing in noisy environments, their multilingual nature leads to increased GPU utilization. Moreover, smaller models exhibit poor performance on low-resource languages such as Azerbaijani. This paper introduces a methodology and a large-scale voice dataset designed for training STT models in Azerbaijani. Over 500 hours of speech data have been collected, and knowledge distillation techniques have been applied at various levels. As a result, the distilled Whisper variant (Pichilti-base) outperforms Whisper-large v3 in Azerbaijani for voice recognition tasks. Additionally, specific post-processing methods have been implemented to mitigate hallucination effects in silent recordings.
Keywords: Knowledge Distillation, Speech to Text, Voice Processing, Low Resource Languages
DOI: https://doi.org/10.54381/icp.2026.1.07
Pichilti – monolingual Azerbaijani distilled Whisper model
The recent saturation in the development of Speech-to-Text (STT) models has been disrupted by the release of multilingual models trained using zero-shot learning. While these models offer powerful and robust capabilities for voice processing in noisy environments, their multilingual nature leads to increased GPU utilization. Moreover, smaller models exhibit poor performance on low-resource languages such as Azerbaijani. This paper introduces a methodology and a large-scale voice dataset designed for training STT models in Azerbaijani. Over 500 hours of speech data have been collected, and knowledge distillation techniques have been applied at various levels. As a result, the distilled Whisper variant (Pichilti-base) outperforms Whisper-large v3 in Azerbaijani for voice recognition tasks. Additionally, specific post-processing methods have been implemented to mitigate hallucination effects in silent recordings.
Keywords: Knowledge Distillation, Speech to Text, Voice Processing, Low Resource Languages
DOI: https://doi.org/10.54381/icp.2026.1.07