AI Datasets

Five open datasets. 461,600 clinical sentences. 184 hours of synthetic medical speech. All DOI-registered, all available to the global medical AI research community.

Curated by physicians, vetted for clinical accuracy.

Open datasets

461,600

Clinical sentences

184 hours

Synthetic medical speech

DOI-registered

Citable artifacts

Open Datasets

Five medical AI datasets

All DOI-registered. CC-BY-NC-4.0 (research-friendly).

Each dataset is gated. You'll need a Hugging Face account, and approval is manual (usually within a day). Click any card to request access.

CC-BY-NC-4.0

medical-tts-parquet-2-16khz

Synthetic medical speech for fine-tuning Whisper-based ASR. 19 voices across 3 English accent groups. Sourced from RxNorm and FDA.

101,475

Audio-text pairs

184.1 hrs

Duration

16 kHz mono

Sample rate

DOI:10.57967/hf/8264

Request Access on Hugging Face

CC-BY-NC-4.0

nursing-sentences-1

Synthetic nursing-specific clinical documentation sentences. 70/15/15 stratified split for ML training.

40,247

Total rows

28K

Train

6K / 6K

Val / Test

DOI:10.57967/hf/8425

Request Access on Hugging Face

CC-BY-NC-4.0

physician-sentences-1

Synthetic physician-specific clinical documentation sentences for training medical ASR and language models.

107,906

Total rows

75K

Train

16K / 16K

Val / Test

DOI:10.57967/hf/8426

Request Access on Hugging Face

CC-BY-NC-4.0

general-medical-sentences-1

Synthetic general medical terminology sentences for broad clinical ASR use across specialties.

313,447

Total rows

219K

Train

47K / 47K

Val / Test

DOI:10.57967/hf/8424

Request Access on Hugging Face

Legacy / reference

Deprecated

medical-tts-parquet-1

Original 24 kHz dataset, superseded by v2. Kept for reference.

Deprecated

Status

DOI:10.57967/hf/8249

View on Hugging Face

License and commercial use

All datasets are released under CC-BY-NC-4.0. Free for academic and research use, with attribution.

Commercial deployment needs a separate license. Talk to us about commercial access.

For research purposes only. Not a medical device. Not intended for clinical decision-making.