News
The submission deadline has been extended:
This extension aims to accommodate ongoing submissions, particularly for Track 2, and teams with pending Data Use Agreements (DUAs). We hope this provides additional time for participation and preparation.
The Ingestion time limit per submission has been increased:
Dev_streaming subset for latency measurement is shared through Box, and the corresponding code has been added to
preprocess.sh.
Note: Approval typically takes ~2–4 weeks. Upon approval, access will be immediately granted to the SAP Research Release, which contains most of the same waveforms that will be part of the official competition release, but in a different data format. The official competition release will be made available on 2026-03-01 to early approvals, or immediately once approved after that date.
Introduction
Welcome to the Speech Accessibility Project Challenge 2(SAPC2).
SAPC2 builds on the success of the Interspeech 2025 Speech Accessibility Project Challenge (Challenge API), which demonstrated significant progress in dysarthric speech recognition — reducing Word Error Rate (WER) from the Whisper-large-v2 baseline of 17.82% to 8.11%. This new edition introduces a larger, more diverse, and etiology-balanced corpus, further promoting fairness, robustness, and inclusivity in impaired-speech ASR. The challenge invites the research community to push the state of the art, develop innovative modeling techniques, and set new standards for accessible speech technology.
Challenge Tracks
The challenge features two complementary tracks:
- Unconstrained ASR Track: Participants may use models of any size or architecture, aiming to advance the state of the art in dysarthric speech recognition.
- Streaming ASR Track: Submitted systems will be placed on a Pareto chart of system latency and system accuracy, promoting lightweight and deployable solutions for real-world use.
Competitors will submit trained model parameters and inference code through Codabench (Track 1; Track 2) up to a maximum number of permitted submissions. Results on test1 will be released within three days of submission. Results on test2 will be released after the close of competition.
Evaluation Metrics
- Accuracy metrics (Track 1 & Track 2)
- Accuracy transcripts are normalized with a fully formatted normalizer adapted from the HuggingFace ASR leaderboard.
- Character Error Rate (CER): primary metric, chosen for better correlation with human judgments and sensitivity to pronunciation variations in dysarthric speech.
- Word Error Rate (WER): secondary metric, reported for comparison with prior work and related literature.
- CER/WER are clipped to 100% at the utterance level. Scores are computed using two references (with and without disfluencies), and the lower error is selected per utterance.
- Latency metrics (Track 2 only)
- Latency is computed from streaming partial results on the streaming manifest (
*_streaming.csv) and reported as median (P50, in ms). - Reference implementation:
compute_latency.py. - Time To First Token (TTFT, P50, ms):
first_non_empty_partial_time - (audio_send_start_time + mfa_speech_start). - Time To Last Token (TTLT, P50, ms):
final_visible_time - audio_end_oracle_time, whereaudio_end_oracle_time = audio_send_start_time + audio_duration_sec. - For robustness analysis, P90 latency may also be reported in detailed outputs.
- For Pareto comparison, we use the average of TTFT and TTLT as latency; non-streaming ASR is assigned infinity.
- Latency is computed from streaming partial results on the streaming manifest (
Prizes & Publication
A total prize of U.S. $10,000 will be divided equally among all teams with a system on the Pareto frontier of accuracy and latency, as measured using the sequestered test2 set.
To clarify how winners are selected across tracks:
- Track 1 (Unconstrained ASR): submissions are non-streaming systems and are ranked by recognition accuracy. For Pareto comparison, Track 1 latency is set to inf. Exactly one non-streaming ASR system will win.
- Track 2 (Streaming ASR): submissions are ranked by the competition’s accuracy-latency criteria, and one or more streaming systems may win.
Teams submitting to the competition will be invited to present their work at a competition workshop, scheduled in conjunction with a major conference (TBA).
References
- [1] Hasegawa-Johnson, M., et al. Community-supported shared infrastructure in support of speech accessibility. JSLHR, 67(11), 4162–4175, 2024.
- [2] Zheng, X., et al. The Interspeech 2025 Speech Accessibility Project Challenge. Proc. Interspeech, 2025.
- [3] Gohider, N., et al. Towards Inclusive and Fair ASR: Insights from the SAPC Challenge for Optimizing Disordered Speech Recognition. Proc. Interspeech, 2025.
- [4] Ducorroy, A., et al. Robust fine-tuning of speech recognition models via model merging: application to disordered speech. Proc. Interspeech, 2025.
- [5] La Quatra, M., et al. Exploring Generative Error Correction for Dysarthric Speech Recognition. Proc. Interspeech, 2025.
- [6] Baumann, I., et al. Pathology-Aware Speech Encoding and Data Augmentation for Dysarthric Speech Recognition. Proc. Interspeech, 2025.
- [7] Wagner, D., et al. Personalized Fine-Tuning with Controllable Synthetic Speech from LLM-Generated Transcripts for Dysarthric Speech Recognition. Proc. Interspeech, 2025.
- [8] Wang, S., et al. A Self-Training Approach for Whisper to Enhance Long Dysarthric Speech Recognition. Proc. Interspeech, 2025.
- [9] Takahashi, K., et al. Fine-tuning Parakeet-TDT for Dysarthric Speech Recognition in the Speech Accessibility Project Challenge. Proc. Interspeech, 2025.
- [10] Tan, T., et al. CBA-Whisper: Curriculum Learning-Based AdaLoRA Fine-Tuning on Whisper for Low-Resource Dysarthric Speech Recognition. Proc. Interspeech, 2025.
- [11] Thennal, D.K., et al. Advocating Character Error Rate for Multilingual ASR Evaluation. Findings of ACL: NAACL 2025.
Acknowledgements
The Speech Accessibility Project is funded by a grant from the AI Accessibility Coalition. Computational resources for the challenge are provided by the National Center for Supercomputing Applications (NCSA). We would also like to thank Rob Kooper (NCSA), Wei Kang (Xiaomi Corp.), and Maisy Wieman (SoundHound AI) for their expertise and invaluable assistance in setting up the challenge.