📄
Abstract - WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.
WARDEN:仅用6小时训练数据实现濒危土著语言的转录与翻译 /
WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
1️⃣ 一句话总结
本文提出了WARDEN系统,通过分离语音转录和文本翻译两个步骤,并结合跨语言初始化音频模型和专家字典辅助大语言模型的方法,仅用6小时标注数据就成功实现了澳大利亚濒危土著语言Wardaman的语音识别与英译,效果优于依赖大量数据的统一模型。