AI Datasets
Five open datasets. 461,600 clinical sentences. 184 hours of synthetic medical speech. All DOI-registered, all available to the global medical AI research community.
Curated by physicians, vetted for clinical accuracy.
Open datasets
Clinical sentences
Synthetic medical speech
Citable artifacts
Five medical AI datasets
All DOI-registered. CC-BY-NC-4.0 (research-friendly).
Each dataset is gated. You'll need a Hugging Face account, and approval is manual (usually within a day). Click any card to request access.
Synthetic medical speech for fine-tuning Whisper-based ASR. 19 voices across 3 English accent groups. Sourced from RxNorm and FDA.
Synthetic nursing-specific clinical documentation sentences. 70/15/15 stratified split for ML training.
Synthetic physician-specific clinical documentation sentences for training medical ASR and language models.
Synthetic general medical terminology sentences for broad clinical ASR use across specialties.
Legacy / reference
Original 24 kHz dataset, superseded by v2. Kept for reference.
License and commercial use
All datasets are released under CC-BY-NC-4.0. Free for academic and research use, with attribution.
Commercial deployment needs a separate license. Talk to us about commercial access.
For research purposes only. Not a medical device. Not intended for clinical decision-making.