GitHub - IS2AI/KSC2: KSC2++: Enhanced Kazakh Speech Corpus with LLM-Based Transcription Restoration

KSC2++: Enhanced Kazakh Speech Corpus with LLM-Based Transcription Restoration

Abstract

Automatic speech recognition (ASR) systems based on large language models (LLMs) have recently achieved strong performance across diverse speech recognition tasks. However, existing LLM-based ASR systems have been primarily developed for high-resource languages, while existing Kazakh ASR corpora provide only normalized transcriptions without punctuation and capitalization. To address this limitation, we introduce KSC2++, an unnormalized extension of the KSC2 corpus containing 1,151 hours of speech and 627,382 utterances with restored punctuation, capitalization, and named entities using LLM-based transcription unnormalization. We fine-tune Qwen3-ASR models on KSC2++ and demonstrate substantial improvements in punctuation-aware and named-entity-aware ASR under realistic evaluation settings. Experimental results show state-of-the-art performance on multiple Kazakh ASR benchmarks, including KSC2, KSD, and KazNERD. Finally, we publicly release the KSC2++ corpus, pretrained models, and source code to support reproducible research in low-resource LLM-based ASR.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KSC2++: Enhanced Kazakh Speech Corpus with LLM-Based Transcription Restoration

Abstract

The dataset, source code, and pretrained models will released soon :)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KSC2++: Enhanced Kazakh Speech Corpus with LLM-Based Transcription Restoration

Abstract

The dataset, source code, and pretrained models will released soon :)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages